Factors and their extent of Impact on House Prices in Saratoga, New York

Aaron Kiss, Dylan Godwin, Nicole Paikkatt, Spencer Goodsall and Wendy Deng

Problem Statement

Home ownership is synonymous with security and wealth.

The global housing market faces increasing fascination and criticism due to rising prices.

With low supply and high demand of housing, these rising prices aren’t slowing, and is of increasing concern within our generation.

House prices in Sydney averaged near $27,500 in 1970, worth about $250,000 in today’s prices. Comparatively, the current median value of house prices in Sydney is $1.1 million.

Problem Justification

Why is predicting house prices important?

  1. Knowledge of factors influencing price is a valuable financial literacy skill for young adults in this environment

  2. It informs decisions about property buying or investment & is crucial in helping gauge the fair market value of a property.

It’s an essential skill for university students facing life ahead.

Dataset

Our dataset is based on information collected on houses in Saratoga County, New York, USA in 2006, containing 1734 observations on 17 variables. Test variable was ignored as its meaning was unknown.

Price = price of the house Lot.Size = size of the house’s lot in acre
Age = age of the house in years Land.Value = value of land ($USD)
Living.Area = living area in square feet Pct.College = percentage of neighbourhood that graduated college
Bedrooms = number of bedrooms Fireplaces = number of fireplaces
Bathrooms = number of bathrooms Rooms = number of rooms
Heating.Type = type of heating system Fuel.Type = type of fuel used for heating
Sewer.Type = type of sewer system Waterfront = whether property includes waterfront
New.Construction = whether the property is a new construction Central.Air = whether the house has central air

Data Selection

In our problem, Price was the dependent variable to be predicted and all other variables are considered independent variables influencing Price.

Data Processing before Analysis

  • Checked that all columns had valid entries

  • Checked that all columns did not have missing entries

Model Selection

To check for the correlation between Price and other properties influencing it a linear regression analysis was completed. The models we explored included:

  1. Full Model - uses all independent variables in the dataset to predict Price
  2. Log Transformation - logged variables Price, Land.Value and Living.Area as they had extra large input entries
  3. Stepwise Forward (with log) - begins with no predictors and adds variables that improve the model’s fit in one at a time, reducing overfitting
  4. Stepwise Backward (with log) - begins with all predictors and removes variables if it does not contribute to the model significantly, ensuring simpler and more efficient model

Model Performance Summary

R squared - represents proportion of variance in the dependent variable that is explained by the independent variables (in-sample)

Adjusted R Squared - R squared but takes into account number of independent variables to address overfitting (in-sample)

RMSE - average magnitude of errors/residuals between predicted and observed value (out-sample)

MAE - absolute magnitude of errors/residuals between predicted and observed value (out-sample)

\(R^2\) Adjusted \(R^2\) RMSE MAE
Full Model 0.6553 0.6509 58014.45 415380.38
Log Transform 0.5941 0.5889 0.2915 0.2077
Stepwise Forward 0.5919 0.5889 0.2926 0.2077
Stepwise Backward 0.5935 0.5897 0.2927 0.2080

Visualisation of Outsample Performance

The visualization of the Full Model was not created because it is necessary to log-transform to fit the assumption. Additionally, the model is on a different scale compared to the logged versions.

Full Model Assumption Check (1)

Linearity - Comparison of linear relationships between independent variable and following dependent variables, conformation to linearity is largely evident in some plots only

Independence - Used Durbin Watson Test to check for autocorrelation. For the Full Model, the DW value was 1.6595 which indicated that Full Model meets the independence assumption

Full Model Assumption Check (2)

Homoskedasticity - residuals are getting more spread out into a funnel shape, violating homoskedasticity

Normality - large spike in the residuals near the top end of the data and a drop in the tail, meaning that the extremities of the set do not conform to the normality assumption

Log Transformation Model Assumption Check (1)

Linearity - Conformation to linearity is far more consistent across dependent variables post log transformation

Independence - 1.5627, indicates independence met

Log Transformation Model Assumption Check (2)

Homoskedasticity - less funnel shape looking, meaning residuals are not getting more spread out and displays a more constant pattern

Normality - comparatively more normal looking than full model, ends are still slightly off the line

Stepwise Model Assumption Check (1)

As the Stepwise Models utilise log transformation, the assumptions found was similar to Log Transformation Models - the following is an example of the Stepwise Forward.

Linearity - Conformation to linearity is far more consistent across dependent variables

Independence - 1.5672, independence met

Stepwise Model Assumption Check (2)

similar Homoskedasticity and Normality observations to log transformation model.

Final Model Interpretation

For our final model we chose the Stepwise Forward model. While the Full Model performed slightly better the Stepwise did not violate the assumptions to the same degree as the Full Model. Multi collinearity was found in the backwards step wise model and the full log model, but not the step wise forward model.

Stepwise Forward Model:

\(log(price) = 6.85 + 0.51 log(Living Area) + 0.13 log(Land Value) + 0.11 Bathrooms +\) \(0.53 Waterfront + 0.08 Heat Type (Hot Air) + 0.06 Heat Type (Hot Water) -\) \(0.35 Heat Type (None) + 0.04 Lot Size - 0.001 Age - 0.11 New Construct -\) \(0.002 Percent College + 0.04 Central Air + 0.01 Room\)

Final Model Visualisation

Intercept = 6.85

Interaction Plot

Stepwise Forward Model was a better Price prediction model as it chose stable predictors which are more relevant for explaining the variations in Price. Proven with following sample interaction plots of Rooms vs Lot.Size.

Limitations

  • Assumptions not precisely met: Concern of outliers and tails from normal QQ plot, but central limit theorem applies with sample size > 30

  • Dependence on AIC: Doesn’t consider all possible combinations of predictors, which makes us potentially miss out on an optimal model. Variables included are dependent on AIC

Future Research and Analysis

  • Prediction of Price per square foot. It’s recognized as a better metric for determining property desirability and quality

  • Further neighborhood, demographic information and occupant status (renters or owners) for more accurate analysis

  • Apply our research to an additional local area to determine the relevance of our model

Q & A